CST Bank: A Corpus for the Study of Cross-document Structural Relationships

نویسندگان

  • Dragomir R. Radev
  • Jahna Otterbacher
  • Zhu Zhang
چکیده

Clusters of multiple news stories related to the same topic exhibit a number of interesting properties. For example, when documents have been published at various points in time or by different authors or news agencies, one finds many instances of paraphrasing, information overlap and even contradiction. The current paper presents the Cross-document Structure Theory (CST) Bank, a collection of multi-document clusters in which pairs of sentences from different documents have been annotated for cross-document structure theory relationships. We will describe how we built the corpus, including our method for reducing the number of sentence pairs to be annotated by our hired judges, using lexical similarity measures. Finally, we will describe how CST and the CST Bank can be applied to different research areas such as multi-document summarization.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-document relationship classification for text summarization

Multiple documents describing the same event present some interesting challenges for natural language processing. They contain similar information and yet they also exhibit a number of interesting properties: paraphrases, partial agreement, difference in judgment and emphasis, and contradictions. When the sources track an event that evolves over time, more phenomena can be observed: additions, ...

متن کامل

Combining Labeled and Unlabeled Data for Learning Cross-Document Structural Relationships

Multi-document discourse analysis has emerged with the potential of improving various NLP applications. Based on the newly proposed Cross-document Structure Theory (CST), this paper describes an empirical study that classifies CST relationships between sentence pairs extracted from topically related documents, exploiting both labeled and unlabeled data. We investigate a binary classifier for de...

متن کامل

A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure

We introduce CST (cross-document structure theory), a paradigm for multidocument analysis. CST takes into account the rhetorical structure of clusters of related textual documents. We present a taxonomy of cross-document relationships. We argue that CST can be the basis for multidocument summarization guided by user preferences for summary length, information provenance, cross-source agreement,...

متن کامل

Providing a Model for Evaluating Suspicious Bank Accounts with the Approach of Determining Tax Effects Based on Structural Equation Modeling

The main approach of this study is to provide solutions to managers, economists, and tax auditors. To have a clearer perspective of the transactional relationships that distress the taxpayer transaction tax also help them to choose the best strategy to improve tax revenue. In this paper the fuzzy Delphi method was used to identify the indicators affecting suspicious bank accounts. The data coll...

متن کامل

بررسی میزان اعتبار آزمون جملات رقابتی در بیماران مبتلا به سکته مغزی، بیمارستان لقمان، 79-1378

Background: Cerebrovasular diseases (CVD) are one of the most common anomalies which may affect on auditory cortex. In this research we have tried to evaluate the function of CANS in a group of 50-70 years old cerebrovascular accident (CVA) patients without hearing problem by using Persian version of C.S.T. Materials and Methods: This cross-sectional analytic study was established at Loghman-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004